Chapter 3 Abstract: Ecological niche can drive the genomic diversification of closely related bacterial species. Here, I apply genomic analyses to the genus Psychrobacter, known for its particularly wide ecological distribution. I performed a pan-genome analysis, microbial pan-genome wide association, ancestral character estimations, and analysis of selection on protein sequences of 85 strains of Psychrobacter to clarify the interactions between isolation source and Psychrobacter genome evolution. There is some evidence that Psychrobacter isolation source correlates with available gene pool; each isolation source - from warm-bodied hosts such as mammals or birds, other hosts like fish or invertebrates, to food processing, marine, or terrestrial environments - is correlated with unique genes from different COG categories. Strains isolated from invertebrate and fish hosts, as well as food processing environments, have higher numbers of total genes, though fewer unique genes than strains from mammals, birds, or marine environments. After accounting for population structure, however, very few genes correlate with isolation source; there is some evidence for increased horizontal gene transfer in strains isolated from fish and invertebrates, and evidence for increased biofilm formation in strains isolated from food-processing environments. Ancestral character estimation supports that Psychrobacter are descended from a warm host-associated bacterium. Finally, there is no correlation between isolation source and cold-adaptation of proteins. Overall, my results show that isolation source has some impact on Psychrobacter genome evolution, though the population structure of the genus makes it difficult to disentangle its effects.
The working directory was changed to /Users/dwelter/ownCloud/Psychrobacter/Data_Analysis/PANX/thesis inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘seqinr’
The following object is masked from ‘package:dplyr’:
count
Attaching package: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
Loading required package: ape
Attaching package: ‘ape’
The following objects are masked from ‘package:seqinr’:
as.alignment, consensus
The following object is masked from ‘package:ggpubr’:
rotate
Loading required package: maps
Loading required package: microseq
Loading required package: tibble
Loading required package: data.table
data.table 1.13.0 using 1 threads (see ?getDTthreads). Latest news: r-datatable.com
**********
This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.
This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.
**********
Attaching package: ‘data.table’
The following objects are masked from ‘package:reshape2’:
dcast, melt
The following objects are masked from ‘package:dplyr’:
between, first, last
Loading required package: rlang
Attaching package: ‘rlang’
The following object is masked from ‘package:data.table’:
:=
Attaching package: ‘microseq’
The following object is masked from ‘package:ape’:
muscle
The following object is masked from ‘package:seqinr’:
translate
The following object is masked from ‘package:base’:
gregexpr
Loading required package: igraph
Attaching package: ‘igraph’
The following object is masked from ‘package:rlang’:
is_named
The following object is masked from ‘package:tibble’:
as_data_frame
The following objects are masked from ‘package:ape’:
edges, mst, ring
The following object is masked from ‘package:tidyr’:
crossing
The following objects are masked from ‘package:dplyr’:
as_data_frame, groups, union
The following objects are masked from ‘package:stats’:
decompose, spectrum
The following object is masked from ‘package:base’:
union
corrplot 0.84 loaded
Loading required package: adegenet
Loading required package: ade4
Registered S3 method overwritten by 'spdep':
method from
plot.mst ape
/// adegenet 2.1.3 is loaded ////////////
> overview: '?adegenet'
> tutorials/doc/questions: 'adegenetWeb()'
> bug reports/feature requests: adegenetIssues()
Attaching package: ‘adegenet’
The following object is masked from ‘package:rlang’:
chr
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] treeWAS_1.0 adegenet_2.1.3 ade4_1.7-15 corrplot_0.84
[5] micropan_2.1 igraph_1.2.5 microseq_2.1.2 rlang_0.4.7
[9] data.table_1.13.0 tibble_3.0.3 phytools_0.7-47 maps_3.3.0
[13] ape_5.4-1 reshape2_1.4.4 seqinr_3.6-1 ggsignif_0.6.0
[17] ggpubr_0.4.0 tidyr_1.1.2 dplyr_1.0.2 ggplot2_3.3.2
[21] stringr_1.4.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 pryr_0.1.4 deldir_0.1-28
[4] ellipsis_0.3.1 class_7.3-17 rio_0.5.16
[7] htmlTable_2.0.1 base64enc_0.1-3 rstudioapi_0.11
[10] codetools_0.2-16 splines_3.6.3 mnormt_2.0.2
[13] knitr_1.29 Formula_1.2-3 broom_0.7.0
[16] cluster_2.1.0 png_0.1-7 shiny_1.5.0
[19] compiler_3.6.3 backports_1.1.9 Matrix_1.2-18
[22] fastmap_1.0.1 later_1.1.0.1 htmltools_0.5.0
[25] tools_3.6.3 coda_0.19-3 gtable_0.3.0
[28] glue_1.4.2 clusterGeneration_1.3.4 gmodels_2.18.1
[31] fastmatch_1.1-0 Rcpp_1.0.5 carData_3.0-4
[34] cellranger_1.1.0 raster_3.3-13 vctrs_0.3.4
[37] spdep_1.1-5 gdata_2.18.0 nlme_3.1-149
[40] xfun_0.16 openxlsx_4.1.5 mime_0.9
[43] lifecycle_0.2.0 phangorn_2.5.5 gtools_3.8.2
[46] rstatix_0.6.0 LearnBayes_2.15.1 MASS_7.3-52
[49] scales_1.1.1 hms_0.5.3 promises_1.1.1
[52] parallel_3.6.3 expm_0.999-5 RColorBrewer_1.1-2
[55] animation_2.6 curl_4.3 gridExtra_2.3
[58] rpart_4.1-15 latticeExtra_0.6-29 stringi_1.4.6
[61] plotrix_3.7-8 checkmate_2.0.0 e1071_1.7-3
[64] permute_0.9-5 boot_1.3-25 zip_2.1.1
[67] spData_0.3.8 pkgconfig_2.0.3 lattice_0.20-41
[70] purrr_0.3.4 sf_0.9-5 htmlwidgets_1.5.1
[73] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5
[76] R6_2.4.1 generics_0.0.2 Hmisc_4.4-1
[79] combinat_0.0-8 DBI_1.1.0 pillar_1.4.6
[82] haven_2.3.1 foreign_0.8-75 withr_2.2.0
[85] mgcv_1.8-33 units_0.6-7 nnet_7.3-14
[88] survival_3.2-3 scatterplot3d_0.3-41 abind_1.4-5
[91] sp_1.4-2 crayon_1.3.4 car_3.0-9
[94] KernSmooth_2.23-17 tmvnsim_1.0-2 jpeg_0.1-8.1
[97] grid_3.6.3 readxl_1.3.1 vegan_2.5-6
[100] forcats_0.5.0 digest_0.6.25 classInt_0.4-3
[103] xtable_1.8-4 httpuv_1.5.4 numDeriv_2016.8-1.1
[106] munsell_0.5.0 beeswarm_0.2.3 quadprog_1.5-8
sessionInfo()
genematrix = psych_gpa[,10:ncol(psych_gpa)] #or whichever column your isolate data starts at
rownames(genematrix) = psych_gpa$geneCluster
genematrix = apply(genematrix, 2, function(x) gsub("^$|^ $", NA, x)) #replace empty cells with NA
transform_func = function(x) { if (is.na(x) > 0) { r = 0 } else { r = 1 } } #creates a function to recognize when a cell has information in it (in this case, when an isolate HAS a gene) and replace that cell's contents with "1". otherwise, replace that cell's contents with "0"
genematrix_binary <- apply(genematrix, 2, function(x){sapply(x, transform_func)}) #loop the function created in the step above through our transposed gene matrix
genematrix_binary_t = t(genematrix_binary) #transpose the matrix; need strains to be on the rows and genes to be on the columns
genematrix_binary_t = as.data.frame(genematrix_binary_t)
genematrix_binary_t = genematrix_binary_t %>%
mutate_if(is.factor, as.numeric) #t() will transform your data into the "matrix" format, and treewas wants it in a dataframe format, so this transforms it back
rownames(genematrix_binary_t) = colnames(genematrix_binary)
#to produce output file "genematrix_for_treewas_fixeddashes.txt"
genes_mat = genematrix
genes_mat_numbered = genes_mat
for(i in 1:nrow(genes_mat)) {
for (j in 1:ncol(genes_mat)) {
cell = genes_mat[i,j]
num = str_count(cell, ",") + 1
if (is.na(cell) == TRUE) {num = 0}
genes_mat_numbered[i,j] = num
}
}
rownames(genes_mat_numbered) = psych_gpa$geneCluster
genes_mat_numbered = as.data.frame(genes_mat_numbered)
genes_mat_numbered = genes_mat_numbered %>%
mutate_all(as.character) %>%
mutate_all(as.numeric)
library(micropan)
genes_mat_forpan = t(genes_mat_numbered)
colnames(genes_mat_forpan) = rownames(genes_mat)
rare = rarefaction(pan.matrix = genes_mat_forpan, n.perm = 999)
2 / 999
3 / 999
4 / 999
5 / 999
6 / 999
7 / 999
8 / 999
9 / 999
10 / 999
11 / 999
12 / 999
13 / 999
14 / 999
15 / 999
16 / 999
17 / 999
18 / 999
19 / 999
20 / 999
21 / 999
22 / 999
23 / 999
24 / 999
25 / 999
26 / 999
27 / 999
28 / 999
29 / 999
30 / 999
31 / 999
32 / 999
33 / 999
34 / 999
35 / 999
36 / 999
37 / 999
38 / 999
39 / 999
40 / 999
41 / 999
42 / 999
43 / 999
44 / 999
45 / 999
46 / 999
47 / 999
48 / 999
49 / 999
50 / 999
51 / 999
52 / 999
53 / 999
54 / 999
55 / 999
56 / 999
57 / 999
58 / 999
59 / 999
60 / 999
61 / 999
62 / 999
63 / 999
64 / 999
65 / 999
66 / 999
67 / 999
68 / 999
69 / 999
70 / 999
71 / 999
72 / 999
73 / 999
74 / 999
75 / 999
76 / 999
77 / 999
78 / 999
79 / 999
80 / 999
81 / 999
82 / 999
83 / 999
84 / 999
85 / 999
86 / 999
87 / 999
88 / 999
89 / 999
90 / 999
91 / 999
92 / 999
93 / 999
94 / 999
95 / 999
96 / 999
97 / 999
98 / 999
99 / 999
100 / 999
101 / 999
102 / 999
103 / 999
104 / 999
105 / 999
106 / 999
107 / 999
108 / 999
109 / 999
110 / 999
111 / 999
112 / 999
113 / 999
114 / 999
115 / 999
116 / 999
117 / 999
118 / 999
119 / 999
120 / 999
121 / 999
122 / 999
123 / 999
124 / 999
125 / 999
126 / 999
127 / 999
128 / 999
129 / 999
130 / 999
131 / 999
132 / 999
133 / 999
134 / 999
135 / 999
136 / 999
137 / 999
138 / 999
139 / 999
140 / 999
141 / 999
142 / 999
143 / 999
144 / 999
145 / 999
146 / 999
147 / 999
148 / 999
149 / 999
150 / 999
151 / 999
152 / 999
153 / 999
154 / 999
155 / 999
156 / 999
157 / 999
158 / 999
159 / 999
160 / 999
161 / 999
162 / 999
163 / 999
164 / 999
165 / 999
166 / 999
167 / 999
168 / 999
169 / 999
170 / 999
171 / 999
172 / 999
173 / 999
174 / 999
175 / 999
176 / 999
177 / 999
178 / 999
179 / 999
180 / 999
181 / 999
182 / 999
183 / 999
184 / 999
185 / 999
186 / 999
187 / 999
188 / 999
189 / 999
190 / 999
191 / 999
192 / 999
193 / 999
194 / 999
195 / 999
196 / 999
197 / 999
198 / 999
199 / 999
200 / 999
201 / 999
202 / 999
203 / 999
204 / 999
205 / 999
206 / 999
207 / 999
208 / 999
209 / 999
210 / 999
211 / 999
212 / 999
213 / 999
214 / 999
215 / 999
216 / 999
217 / 999
218 / 999
219 / 999
220 / 999
221 / 999
222 / 999
223 / 999
224 / 999
225 / 999
226 / 999
227 / 999
228 / 999
229 / 999
230 / 999
231 / 999
232 / 999
233 / 999
234 / 999
235 / 999
236 / 999
237 / 999
238 / 999
239 / 999
240 / 999
241 / 999
242 / 999
243 / 999
244 / 999
245 / 999
246 / 999
247 / 999
248 / 999
249 / 999
250 / 999
251 / 999
252 / 999
253 / 999
254 / 999
255 / 999
256 / 999
257 / 999
258 / 999
259 / 999
260 / 999
261 / 999
262 / 999
263 / 999
264 / 999
265 / 999
266 / 999
267 / 999
268 / 999
269 / 999
270 / 999
271 / 999
272 / 999
273 / 999
274 / 999
275 / 999
276 / 999
277 / 999
278 / 999
279 / 999
280 / 999
281 / 999
282 / 999
283 / 999
284 / 999
285 / 999
286 / 999
287 / 999
288 / 999
289 / 999
290 / 999
291 / 999
292 / 999
293 / 999
294 / 999
295 / 999
296 / 999
297 / 999
298 / 999
299 / 999
300 / 999
301 / 999
302 / 999
303 / 999
304 / 999
305 / 999
306 / 999
307 / 999
308 / 999
309 / 999
310 / 999
311 / 999
312 / 999
313 / 999
314 / 999
315 / 999
316 / 999
317 / 999
318 / 999
319 / 999
320 / 999
321 / 999
322 / 999
323 / 999
324 / 999
325 / 999
326 / 999
327 / 999
328 / 999
329 / 999
330 / 999
331 / 999
332 / 999
333 / 999
334 / 999
335 / 999
336 / 999
337 / 999
338 / 999
339 / 999
340 / 999
341 / 999
342 / 999
343 / 999
344 / 999
345 / 999
346 / 999
347 / 999
348 / 999
349 / 999
350 / 999
351 / 999
352 / 999
353 / 999
354 / 999
355 / 999
356 / 999
357 / 999
358 / 999
359 / 999
360 / 999
361 / 999
362 / 999
363 / 999
364 / 999
365 / 999
366 / 999
367 / 999
368 / 999
369 / 999
370 / 999
371 / 999
372 / 999
373 / 999
374 / 999
375 / 999
376 / 999
377 / 999
378 / 999
379 / 999
380 / 999
381 / 999
382 / 999
383 / 999
384 / 999
385 / 999
386 / 999
387 / 999
388 / 999
389 / 999
390 / 999
391 / 999
392 / 999
393 / 999
394 / 999
395 / 999
396 / 999
397 / 999
398 / 999
399 / 999
400 / 999
401 / 999
402 / 999
403 / 999
404 / 999
405 / 999
406 / 999
407 / 999
408 / 999
409 / 999
410 / 999
411 / 999
412 / 999
413 / 999
414 / 999
415 / 999
416 / 999
417 / 999
418 / 999
419 / 999
420 / 999
421 / 999
422 / 999
423 / 999
424 / 999
425 / 999
426 / 999
427 / 999
428 / 999
429 / 999
430 / 999
431 / 999
432 / 999
433 / 999
434 / 999
435 / 999
436 / 999
437 / 999
438 / 999
439 / 999
440 / 999
441 / 999
442 / 999
443 / 999
444 / 999
445 / 999
446 / 999
447 / 999
448 / 999
449 / 999
450 / 999
451 / 999
452 / 999
453 / 999
454 / 999
455 / 999
456 / 999
457 / 999
458 / 999
459 / 999
460 / 999
461 / 999
462 / 999
463 / 999
464 / 999
465 / 999
466 / 999
467 / 999
468 / 999
469 / 999
470 / 999
471 / 999
472 / 999
473 / 999
474 / 999
475 / 999
476 / 999
477 / 999
478 / 999
479 / 999
480 / 999
481 / 999
482 / 999
483 / 999
484 / 999
485 / 999
486 / 999
487 / 999
488 / 999
489 / 999
490 / 999
491 / 999
492 / 999
493 / 999
494 / 999
495 / 999
496 / 999
497 / 999
498 / 999
499 / 999
500 / 999
501 / 999
502 / 999
503 / 999
504 / 999
505 / 999
506 / 999
507 / 999
508 / 999
509 / 999
510 / 999
511 / 999
512 / 999
513 / 999
514 / 999
515 / 999
516 / 999
517 / 999
518 / 999
519 / 999
520 / 999
521 / 999
522 / 999
523 / 999
524 / 999
525 / 999
526 / 999
527 / 999
528 / 999
529 / 999
530 / 999
531 / 999
532 / 999
533 / 999
534 / 999
535 / 999
536 / 999
537 / 999
538 / 999
539 / 999
540 / 999
541 / 999
542 / 999
543 / 999
544 / 999
545 / 999
546 / 999
547 / 999
548 / 999
549 / 999
550 / 999
551 / 999
552 / 999
553 / 999
554 / 999
555 / 999
556 / 999
557 / 999
558 / 999
559 / 999
560 / 999
561 / 999
562 / 999
563 / 999
564 / 999
565 / 999
566 / 999
567 / 999
568 / 999
569 / 999
570 / 999
571 / 999
572 / 999
573 / 999
574 / 999
575 / 999
576 / 999
577 / 999
578 / 999
579 / 999
580 / 999
581 / 999
582 / 999
583 / 999
584 / 999
585 / 999
586 / 999
587 / 999
588 / 999
589 / 999
590 / 999
591 / 999
592 / 999
593 / 999
594 / 999
595 / 999
596 / 999
597 / 999
598 / 999
599 / 999
600 / 999
601 / 999
602 / 999
603 / 999
604 / 999
605 / 999
606 / 999
607 / 999
608 / 999
609 / 999
610 / 999
611 / 999
612 / 999
613 / 999
614 / 999
615 / 999
616 / 999
617 / 999
618 / 999
619 / 999
620 / 999
621 / 999
622 / 999
623 / 999
624 / 999
625 / 999
626 / 999
627 / 999
628 / 999
629 / 999
630 / 999
631 / 999
632 / 999
633 / 999
634 / 999
635 / 999
636 / 999
637 / 999
638 / 999
639 / 999
640 / 999
641 / 999
642 / 999
643 / 999
644 / 999
645 / 999
646 / 999
647 / 999
648 / 999
649 / 999
650 / 999
651 / 999
652 / 999
653 / 999
654 / 999
655 / 999
656 / 999
657 / 999
658 / 999
659 / 999
660 / 999
661 / 999
662 / 999
663 / 999
664 / 999
665 / 999
666 / 999
667 / 999
668 / 999
669 / 999
670 / 999
671 / 999
672 / 999
673 / 999
674 / 999
675 / 999
676 / 999
677 / 999
678 / 999
679 / 999
680 / 999
681 / 999
682 / 999
683 / 999
684 / 999
685 / 999
686 / 999
687 / 999
688 / 999
689 / 999
690 / 999
691 / 999
692 / 999
693 / 999
694 / 999
695 / 999
696 / 999
697 / 999
698 / 999
699 / 999
700 / 999
701 / 999
702 / 999
703 / 999
704 / 999
705 / 999
706 / 999
707 / 999
708 / 999
709 / 999
710 / 999
711 / 999
712 / 999
713 / 999
714 / 999
715 / 999
716 / 999
717 / 999
718 / 999
719 / 999
720 / 999
721 / 999
722 / 999
723 / 999
724 / 999
725 / 999
726 / 999
727 / 999
728 / 999
729 / 999
730 / 999
731 / 999
732 / 999
733 / 999
734 / 999
735 / 999
736 / 999
737 / 999
738 / 999
739 / 999
740 / 999
741 / 999
742 / 999
743 / 999
744 / 999
745 / 999
746 / 999
747 / 999
748 / 999
749 / 999
750 / 999
751 / 999
752 / 999
753 / 999
754 / 999
755 / 999
756 / 999
757 / 999
758 / 999
759 / 999
760 / 999
761 / 999
762 / 999
763 / 999
764 / 999
765 / 999
766 / 999
767 / 999
768 / 999
769 / 999
770 / 999
771 / 999
772 / 999
773 / 999
774 / 999
775 / 999
776 / 999
777 / 999
778 / 999
779 / 999
780 / 999
781 / 999
782 / 999
783 / 999
784 / 999
785 / 999
786 / 999
787 / 999
788 / 999
789 / 999
790 / 999
791 / 999
792 / 999
793 / 999
794 / 999
795 / 999
796 / 999
797 / 999
798 / 999
799 / 999
800 / 999
801 / 999
802 / 999
803 / 999
804 / 999
805 / 999
806 / 999
807 / 999
808 / 999
809 / 999
810 / 999
811 / 999
812 / 999
813 / 999
814 / 999
815 / 999
816 / 999
817 / 999
818 / 999
819 / 999
820 / 999
821 / 999
822 / 999
823 / 999
824 / 999
825 / 999
826 / 999
827 / 999
828 / 999
829 / 999
830 / 999
831 / 999
832 / 999
833 / 999
834 / 999
835 / 999
836 / 999
837 / 999
838 / 999
839 / 999
840 / 999
841 / 999
842 / 999
843 / 999
844 / 999
845 / 999
846 / 999
847 / 999
848 / 999
849 / 999
850 / 999
851 / 999
852 / 999
853 / 999
854 / 999
855 / 999
856 / 999
857 / 999
858 / 999
859 / 999
860 / 999
861 / 999
862 / 999
863 / 999
864 / 999
865 / 999
866 / 999
867 / 999
868 / 999
869 / 999
870 / 999
871 / 999
872 / 999
873 / 999
874 / 999
875 / 999
876 / 999
877 / 999
878 / 999
879 / 999
880 / 999
881 / 999
882 / 999
883 / 999
884 / 999
885 / 999
886 / 999
887 / 999
888 / 999
889 / 999
890 / 999
891 / 999
892 / 999
893 / 999
894 / 999
895 / 999
896 / 999
897 / 999
898 / 999
899 / 999
900 / 999
901 / 999
902 / 999
903 / 999
904 / 999
905 / 999
906 / 999
907 / 999
908 / 999
909 / 999
910 / 999
911 / 999
912 / 999
913 / 999
914 / 999
915 / 999
916 / 999
917 / 999
918 / 999
919 / 999
920 / 999
921 / 999
922 / 999
923 / 999
924 / 999
925 / 999
926 / 999
927 / 999
928 / 999
929 / 999
930 / 999
931 / 999
932 / 999
933 / 999
934 / 999
935 / 999
936 / 999
937 / 999
938 / 999
939 / 999
940 / 999
941 / 999
942 / 999
943 / 999
944 / 999
945 / 999
946 / 999
947 / 999
948 / 999
949 / 999
950 / 999
951 / 999
952 / 999
953 / 999
954 / 999
955 / 999
956 / 999
957 / 999
958 / 999
959 / 999
960 / 999
961 / 999
962 / 999
963 / 999
964 / 999
965 / 999
966 / 999
967 / 999
968 / 999
969 / 999
970 / 999
971 / 999
972 / 999
973 / 999
974 / 999
975 / 999
976 / 999
977 / 999
978 / 999
979 / 999
980 / 999
981 / 999
982 / 999
983 / 999
984 / 999
985 / 999
986 / 999
987 / 999
988 / 999
989 / 999
990 / 999
991 / 999
992 / 999
993 / 999
994 / 999
995 / 999
996 / 999
997 / 999
998 / 999
999 / 999
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ...
rare.melt = reshape2::melt(rare, id = c("Genome"))
(panrarefaction = ggplot(data = rare.melt, aes(x = Genome, y = value)) +
geom_boxplot(aes(x = Genome, group = Genome), outlier.shape = NA) +
theme_classic() +
xlab("Number of Genomes") + ylab("Number of Novel Orthologs"))
heaps(pan.matrix = genes_mat_forpan, n.perm = 999)
1 / 999
2 / 999
3 / 999
4 / 999
5 / 999
6 / 999
7 / 999
8 / 999
9 / 999
10 / 999
11 / 999
12 / 999
13 / 999
14 / 999
15 / 999
16 / 999
17 / 999
18 / 999
19 / 999
20 / 999
21 / 999
22 / 999
23 / 999
24 / 999
25 / 999
26 / 999
27 / 999
28 / 999
29 / 999
30 / 999
31 / 999
32 / 999
33 / 999
34 / 999
35 / 999
36 / 999
37 / 999
38 / 999
39 / 999
40 / 999
41 / 999
42 / 999
43 / 999
44 / 999
45 / 999
46 / 999
47 / 999
48 / 999
49 / 999
50 / 999
51 / 999
52 / 999
53 / 999
54 / 999
55 / 999
56 / 999
57 / 999
58 / 999
59 / 999
60 / 999
61 / 999
62 / 999
63 / 999
64 / 999
65 / 999
66 / 999
67 / 999
68 / 999
69 / 999
70 / 999
71 / 999
72 / 999
73 / 999
74 / 999
75 / 999
76 / 999
77 / 999
78 / 999
79 / 999
80 / 999
81 / 999
82 / 999
83 / 999
84 / 999
85 / 999
86 / 999
87 / 999
88 / 999
89 / 999
90 / 999
91 / 999
92 / 999
93 / 999
94 / 999
95 / 999
96 / 999
97 / 999
98 / 999
99 / 999
100 / 999
101 / 999
102 / 999
103 / 999
104 / 999
105 / 999
106 / 999
107 / 999
108 / 999
109 / 999
110 / 999
111 / 999
112 / 999
113 / 999
114 / 999
115 / 999
116 / 999
117 / 999
118 / 999
119 / 999
120 / 999
121 / 999
122 / 999
123 / 999
124 / 999
125 / 999
126 / 999
127 / 999
128 / 999
129 / 999
130 / 999
131 / 999
132 / 999
133 / 999
134 / 999
135 / 999
136 / 999
137 / 999
138 / 999
139 / 999
140 / 999
141 / 999
142 / 999
143 / 999
144 / 999
145 / 999
146 / 999
147 / 999
148 / 999
149 / 999
150 / 999
151 / 999
152 / 999
153 / 999
154 / 999
155 / 999
156 / 999
157 / 999
158 / 999
159 / 999
160 / 999
161 / 999
162 / 999
163 / 999
164 / 999
165 / 999
166 / 999
167 / 999
168 / 999
169 / 999
170 / 999
171 / 999
172 / 999
173 / 999
174 / 999
175 / 999
176 / 999
177 / 999
178 / 999
179 / 999
180 / 999
181 / 999
182 / 999
183 / 999
184 / 999
185 / 999
186 / 999
187 / 999
188 / 999
189 / 999
190 / 999
191 / 999
192 / 999
193 / 999
194 / 999
195 / 999
196 / 999
197 / 999
198 / 999
199 / 999
200 / 999
201 / 999
202 / 999
203 / 999
204 / 999
205 / 999
206 / 999
207 / 999
208 / 999
209 / 999
210 / 999
211 / 999
212 / 999
213 / 999
214 / 999
215 / 999
216 / 999
217 / 999
218 / 999
219 / 999
220 / 999
221 / 999
222 / 999
223 / 999
224 / 999
225 / 999
226 / 999
227 / 999
228 / 999
229 / 999
230 / 999
231 / 999
232 / 999
233 / 999
234 / 999
235 / 999
236 / 999
237 / 999
238 / 999
239 / 999
240 / 999
241 / 999
242 / 999
243 / 999
244 / 999
245 / 999
246 / 999
247 / 999
248 / 999
249 / 999
250 / 999
251 / 999
252 / 999
253 / 999
254 / 999
255 / 999
256 / 999
257 / 999
258 / 999
259 / 999
260 / 999
261 / 999
262 / 999
263 / 999
264 / 999
265 / 999
266 / 999
267 / 999
268 / 999
269 / 999
270 / 999
271 / 999
272 / 999
273 / 999
274 / 999
275 / 999
276 / 999
277 / 999
278 / 999
279 / 999
280 / 999
281 / 999
282 / 999
283 / 999
284 / 999
285 / 999
286 / 999
287 / 999
288 / 999
289 / 999
290 / 999
291 / 999
292 / 999
293 / 999
294 / 999
295 / 999
296 / 999
297 / 999
298 / 999
299 / 999
300 / 999
301 / 999
302 / 999
303 / 999
304 / 999
305 / 999
306 / 999
307 / 999
308 / 999
309 / 999
310 / 999
311 / 999
312 / 999
313 / 999
314 / 999
315 / 999
316 / 999
317 / 999
318 / 999
319 / 999
320 / 999
321 / 999
322 / 999
323 / 999
324 / 999
325 / 999
326 / 999
327 / 999
328 / 999
329 / 999
330 / 999
331 / 999
332 / 999
333 / 999
334 / 999
335 / 999
336 / 999
337 / 999
338 / 999
339 / 999
340 / 999
341 / 999
342 / 999
343 / 999
344 / 999
345 / 999
346 / 999
347 / 999
348 / 999
349 / 999
350 / 999
351 / 999
352 / 999
353 / 999
354 / 999
355 / 999
356 / 999
357 / 999
358 / 999
359 / 999
360 / 999
361 / 999
362 / 999
363 / 999
364 / 999
365 / 999
366 / 999
367 / 999
368 / 999
369 / 999
370 / 999
371 / 999
372 / 999
373 / 999
374 / 999
375 / 999
376 / 999
377 / 999
378 / 999
379 / 999
380 / 999
381 / 999
382 / 999
383 / 999
384 / 999
385 / 999
386 / 999
387 / 999
388 / 999
389 / 999
390 / 999
391 / 999
392 / 999
393 / 999
394 / 999
395 / 999
396 / 999
397 / 999
398 / 999
399 / 999
400 / 999
401 / 999
402 / 999
403 / 999
404 / 999
405 / 999
406 / 999
407 / 999
408 / 999
409 / 999
410 / 999
411 / 999
412 / 999
413 / 999
414 / 999
415 / 999
416 / 999
417 / 999
418 / 999
419 / 999
420 / 999
421 / 999
422 / 999
423 / 999
424 / 999
425 / 999
426 / 999
427 / 999
428 / 999
429 / 999
430 / 999
431 / 999
432 / 999
433 / 999
434 / 999
435 / 999
436 / 999
437 / 999
438 / 999
439 / 999
440 / 999
441 / 999
442 / 999
443 / 999
444 / 999
445 / 999
446 / 999
447 / 999
448 / 999
449 / 999
450 / 999
451 / 999
452 / 999
453 / 999
454 / 999
455 / 999
456 / 999
457 / 999
458 / 999
459 / 999
460 / 999
461 / 999
462 / 999
463 / 999
464 / 999
465 / 999
466 / 999
467 / 999
468 / 999
469 / 999
470 / 999
471 / 999
472 / 999
473 / 999
474 / 999
475 / 999
476 / 999
477 / 999
478 / 999
479 / 999
480 / 999
481 / 999
482 / 999
483 / 999
484 / 999
485 / 999
486 / 999
487 / 999
488 / 999
489 / 999
490 / 999
491 / 999
492 / 999
493 / 999
494 / 999
495 / 999
496 / 999
497 / 999
498 / 999
499 / 999
500 / 999
501 / 999
502 / 999
503 / 999
504 / 999
505 / 999
506 / 999
507 / 999
508 / 999
509 / 999
510 / 999
511 / 999
512 / 999
513 / 999
514 / 999
515 / 999
516 / 999
517 / 999
518 / 999
519 / 999
520 / 999
521 / 999
522 / 999
523 / 999
524 / 999
525 / 999
526 / 999
527 / 999
528 / 999
529 / 999
530 / 999
531 / 999
532 / 999
533 / 999
534 / 999
535 / 999
536 / 999
537 / 999
538 / 999
539 / 999
540 / 999
541 / 999
542 / 999
543 / 999
544 / 999
545 / 999
546 / 999
547 / 999
548 / 999
549 / 999
550 / 999
551 / 999
552 / 999
553 / 999
554 / 999
555 / 999
556 / 999
557 / 999
558 / 999
559 / 999
560 / 999
561 / 999
562 / 999
563 / 999
564 / 999
565 / 999
566 / 999
567 / 999
568 / 999
569 / 999
570 / 999
571 / 999
572 / 999
573 / 999
574 / 999
575 / 999
576 / 999
577 / 999
578 / 999
579 / 999
580 / 999
581 / 999
582 / 999
583 / 999
584 / 999
585 / 999
586 / 999
587 / 999
588 / 999
589 / 999
590 / 999
591 / 999
592 / 999
593 / 999
594 / 999
595 / 999
596 / 999
597 / 999
598 / 999
599 / 999
600 / 999
601 / 999
602 / 999
603 / 999
604 / 999
605 / 999
606 / 999
607 / 999
608 / 999
609 / 999
610 / 999
611 / 999
612 / 999
613 / 999
614 / 999
615 / 999
616 / 999
617 / 999
618 / 999
619 / 999
620 / 999
621 / 999
622 / 999
623 / 999
624 / 999
625 / 999
626 / 999
627 / 999
628 / 999
629 / 999
630 / 999
631 / 999
632 / 999
633 / 999
634 / 999
635 / 999
636 / 999
637 / 999
638 / 999
639 / 999
640 / 999
641 / 999
642 / 999
643 / 999
644 / 999
645 / 999
646 / 999
647 / 999
648 / 999
649 / 999
650 / 999
651 / 999
652 / 999
653 / 999
654 / 999
655 / 999
656 / 999
657 / 999
658 / 999
659 / 999
660 / 999
661 / 999
662 / 999
663 / 999
664 / 999
665 / 999
666 / 999
667 / 999
668 / 999
669 / 999
670 / 999
671 / 999
672 / 999
673 / 999
674 / 999
675 / 999
676 / 999
677 / 999
678 / 999
679 / 999
680 / 999
681 / 999
682 / 999
683 / 999
684 / 999
685 / 999
686 / 999
687 / 999
688 / 999
689 / 999
690 / 999
691 / 999
692 / 999
693 / 999
694 / 999
695 / 999
696 / 999
697 / 999
698 / 999
699 / 999
700 / 999
701 / 999
702 / 999
703 / 999
704 / 999
705 / 999
706 / 999
707 / 999
708 / 999
709 / 999
710 / 999
711 / 999
712 / 999
713 / 999
714 / 999
715 / 999
716 / 999
717 / 999
718 / 999
719 / 999
720 / 999
721 / 999
722 / 999
723 / 999
724 / 999
725 / 999
726 / 999
727 / 999
728 / 999
729 / 999
730 / 999
731 / 999
732 / 999
733 / 999
734 / 999
735 / 999
736 / 999
737 / 999
738 / 999
739 / 999
740 / 999
741 / 999
742 / 999
743 / 999
744 / 999
745 / 999
746 / 999
747 / 999
748 / 999
749 / 999
750 / 999
751 / 999
752 / 999
753 / 999
754 / 999
755 / 999
756 / 999
757 / 999
758 / 999
759 / 999
760 / 999
761 / 999
762 / 999
763 / 999
764 / 999
765 / 999
766 / 999
767 / 999
768 / 999
769 / 999
770 / 999
771 / 999
772 / 999
773 / 999
774 / 999
775 / 999
776 / 999
777 / 999
778 / 999
779 / 999
780 / 999
781 / 999
782 / 999
783 / 999
784 / 999
785 / 999
786 / 999
787 / 999
788 / 999
789 / 999
790 / 999
791 / 999
792 / 999
793 / 999
794 / 999
795 / 999
796 / 999
797 / 999
798 / 999
799 / 999
800 / 999
801 / 999
802 / 999
803 / 999
804 / 999
805 / 999
806 / 999
807 / 999
808 / 999
809 / 999
810 / 999
811 / 999
812 / 999
813 / 999
814 / 999
815 / 999
816 / 999
817 / 999
818 / 999
819 / 999
820 / 999
821 / 999
822 / 999
823 / 999
824 / 999
825 / 999
826 / 999
827 / 999
828 / 999
829 / 999
830 / 999
831 / 999
832 / 999
833 / 999
834 / 999
835 / 999
836 / 999
837 / 999
838 / 999
839 / 999
840 / 999
841 / 999
842 / 999
843 / 999
844 / 999
845 / 999
846 / 999
847 / 999
848 / 999
849 / 999
850 / 999
851 / 999
852 / 999
853 / 999
854 / 999
855 / 999
856 / 999
857 / 999
858 / 999
859 / 999
860 / 999
861 / 999
862 / 999
863 / 999
864 / 999
865 / 999
866 / 999
867 / 999
868 / 999
869 / 999
870 / 999
871 / 999
872 / 999
873 / 999
874 / 999
875 / 999
876 / 999
877 / 999
878 / 999
879 / 999
880 / 999
881 / 999
882 / 999
883 / 999
884 / 999
885 / 999
886 / 999
887 / 999
888 / 999
889 / 999
890 / 999
891 / 999
892 / 999
893 / 999
894 / 999
895 / 999
896 / 999
897 / 999
898 / 999
899 / 999
900 / 999
901 / 999
902 / 999
903 / 999
904 / 999
905 / 999
906 / 999
907 / 999
908 / 999
909 / 999
910 / 999
911 / 999
912 / 999
913 / 999
914 / 999
915 / 999
916 / 999
917 / 999
918 / 999
919 / 999
920 / 999
921 / 999
922 / 999
923 / 999
924 / 999
925 / 999
926 / 999
927 / 999
928 / 999
929 / 999
930 / 999
931 / 999
932 / 999
933 / 999
934 / 999
935 / 999
936 / 999
937 / 999
938 / 999
939 / 999
940 / 999
941 / 999
942 / 999
943 / 999
944 / 999
945 / 999
946 / 999
947 / 999
948 / 999
949 / 999
950 / 999
951 / 999
952 / 999
953 / 999
954 / 999
955 / 999
956 / 999
957 / 999
958 / 999
959 / 999
960 / 999
961 / 999
962 / 999
963 / 999
964 / 999
965 / 999
966 / 999
967 / 999
968 / 999
969 / 999
970 / 999
971 / 999
972 / 999
973 / 999
974 / 999
975 / 999
976 / 999
977 / 999
978 / 999
979 / 999
980 / 999
981 / 999
982 / 999
983 / 999
984 / 999
985 / 999
986 / 999
987 / 999
988 / 999
989 / 999
990 / 999
991 / 999
992 / 999
993 / 999
994 / 999
995 / 999
996 / 999
997 / 999
998 / 999
999 / 999
Intercept alpha
579.6761870 0.4551001
warm = psych_strain_collection %>% filter(ISME_source == "warm host") %>% select(strainID) %>% unlist() %>% unname()
other = psych_strain_collection %>% filter(ISME_source == "other host") %>% select(strainID) %>% unlist() %>% unname()
food = psych_strain_collection %>% filter(ISME_source == "food") %>% select(strainID) %>% unlist() %>% unname()
marine = psych_strain_collection %>% filter(ISME_source == "marine") %>% select(strainID) %>% unlist() %>% unname()
terrestrial = psych_strain_collection %>% filter(ISME_source == "terrestrial") %>% select(strainID) %>% unlist() %>% unname()
allseqsum = rowSums(genes_mat_numbered)
names(allseqsum) = rownames(genes_mat)
allisosum = rowSums(genematrix_binary)
names(allisosum) = rownames(genes_mat)
pangenome_by_isolation_summary = rbind(allseqsum, allisosum)
pangenome_by_isolation_summary = t(pangenome_by_isolation_summary)
pangenome_by_isolation_summary = as.data.frame(pangenome_by_isolation_summary)
pangenome_by_isolation_summary$geneCluster = rownames(pangenome_by_isolation_summary)
pangenome_by_isolation_summary$overall_pan_cat = NA
pangenome_by_isolation_summary$overall_pan_cat[pangenome_by_isolation_summary$allisosum >= 84] <- "core"
pangenome_by_isolation_summary$overall_pan_cat[pangenome_by_isolation_summary$allisosum <= 83 & pangenome_by_isolation_summary$allisosum >= 77] <- "soft core"
pangenome_by_isolation_summary$overall_pan_cat[pangenome_by_isolation_summary$allisosum <= 76 & pangenome_by_isolation_summary$allisosum >= 2 ] <- "shell"
pangenome_by_isolation_summary$overall_pan_cat[pangenome_by_isolation_summary$allisosum == 1 ] <- "cloud"
pangenome_by_isolation_summary$overall_pan_cat = factor(pangenome_by_isolation_summary$overall_pan_cat,
levels = c("core", "soft core", "shell", "cloud"))
(pancategory_count = ggplot(data = pangenome_by_isolation_summary, aes(x = overall_pan_cat)) +
geom_bar(stat = 'count') +
theme_classic()
)
ggarrange(pancategory_count, panrarefaction, align = 'h', widths = c(1, 2))
COG_multi_cats = multicog_geneclusters %>% filter(COG2 != "")
#just realized that this does not have every COG cat for every strain, for example there is no COG CAT "B" for frigidicola_056...
#will have to make a new data frame to populate from scratch
COG_multi_cats_separated = data.frame()
for(i in 1:nrow(COG_multi_cats)) {
gen = COG_multi_cats[i,1]
cat = COG_multi_cats[i,2]
COG1 = COG_multi_cats[i,3]
dummy1 = c(gen, cat, COG1)
COG_multi_cats_separated = rbind.data.frame(COG_multi_cats_separated, as.character(dummy1), stringsAsFactors = F)
COG2 = COG_multi_cats[i,4]
dummy2 = c(gen, cat, COG2)
COG_multi_cats_separated = rbind.data.frame(COG_multi_cats_separated, as.character(dummy2), stringsAsFactors = F)
COG3 = COG_multi_cats[i,5]
if (COG3 != ""){
dummy3 = c(gen, cat, COG3)
COG_multi_cats_separated = rbind.data.frame(COG_multi_cats_separated, as.character(dummy3), stringsAsFactors = F)
}
COG4 = COG_multi_cats[i,6]
if (COG4 != ""){
dummy4 = c(gen, cat, COG4)
COG_multi_cats_separated = rbind.data.frame(COG_multi_cats_separated, as.character(dummy4), stringsAsFactors = F)
}
}
colnames(COG_multi_cats_separated) = c("geneCluster", "pan_category", "COG_category")
COG_single_cats = multicog_geneclusters %>% filter(COG2 == "") %>%
select(geneCluster, pan_category, COG1)
colnames(COG_single_cats)[3] = "COG_category"
multiCOG_genes_corrected = rbind(COG_multi_cats_separated, COG_single_cats)
multiCOG_genes_corrected$pan_category = factor(multiCOG_genes_corrected$pan_category,
levels = c("core", "softcore", "shell", "cloud"))
multiCOG_genes_corrected$COG_category =
factor(multiCOG_genes_corrected$COG_category, levels = c("D", "M", "N", "O", "T", "U", "V", "W", "Z",
"A", "B","J", "K", "L","C", "E", "F", "G", "H", "I", "P", "Q",
"S", "X"))
COGcols = c("#72300C", "#93074D", "#D60D69", "#ED5FA6", "#CE1B13", "#A50505", "#E5530A", "#E2770C", "#F9A138",
"#F7F072", "#F7D239", "#CCF43B", "#5BB73D", "#057F28",
"#05541A", "#20998B", "#60B1F4", "#1458F2", "#5141AF", "#1F048E", "#310872", "#520775",
"#9F9FA0", "#39383A")
(function_pancat = ggplot(data = multiCOG_genes_corrected,
aes(x = pan_category, fill = COG_category)) +
geom_bar(stat = 'count', position = 'fill') +
scale_fill_manual(values = COGcols) +
theme_classic())
COG_pancat_chi = chisq.test(t(table(multiCOG_genes_corrected$pan_category, multiCOG_genes_corrected$COG_category)))
Chi-squared approximation may be incorrect
COG_pancat_chi
Pearson's Chi-squared test
data: t(table(multiCOG_genes_corrected$pan_category, multiCOG_genes_corrected$COG_category))
X-squared = 2748, df = 69, p-value < 2.2e-16
COG_pancat_cor = corrplot(COG_pancat_chi$residuals, is.cor = FALSE)
#add function_pancat and COG_pancat_cor together in Illustrator
psych_PCAT = left_join(psych_PCAT, pangenome_by_isolation_summary[,3:4])
Joining, by = "geneCluster"
test = left_join(psych_PCAT[,c(2:4,33)], psych_strain_collection[,c(3,27)])
Joining, by = "strainID"
test2 = as.data.frame(table(test$strainID, test$overall_pan_cat))
colnames(test2) = c("strainID", "pan_cat", "frequency")
test2 = left_join(test2, psych_strain_collection[,c(3,27)])
Joining, by = "strainID"
number_genes_perstrain = as.data.frame(table(psych_PCAT$strainID))
colnames(number_genes_perstrain) = c("strainID", "num_genes")
test2 = left_join(test2, number_genes_perstrain)
Joining, by = "strainID"
test2$ISME_source[test2$strainID == "P_sp_IAM12030_72-O-c"] <- "terrestrial"
agg_pancat_byiso_bygenome = aggregate(test2$frequency, by = list(test2$ISME_source, test2$pan_cat), FUN = mean)
colnames(agg_pancat_byiso_bygenome) = c("isolation_source", "pan_cat", "average_geneclusters_pergenome")
agg_pancat_byiso_bygenome$pan_cat = factor(agg_pancat_byiso_bygenome$pan_cat, levels = c("core", "soft core", "shell", "cloud"))
agg_pancat_byiso_bygenome_sp = spread(agg_pancat_byiso_bygenome, key = pan_cat, value = average_geneclusters_pergenome)
chisq_pan_iso = chisq.test(agg_pancat_byiso_bygenome_sp[1:5,2:5])
corrplot(chisq_pan_iso$residuals, is.corr = F)
aggregate(test2$frequency, by = list(test2$pan_cat), FUN = mean)
Group.1 x
1 core 1148.48235
2 soft core 449.64706
3 shell 1004.58824
4 cloud 70.36471
pancat_byiso_bygenome = as.data.frame(table(psych_PCAT$strainID, psych_PCAT$overall_pan_cat))
colnames(pancat_byiso_bygenome) = c("strainID", "pan_cat", "frequency")
pancat_byiso_bygenome = left_join(pancat_byiso_bygenome, psych_strain_collection[,c(3,27)])
Joining, by = "strainID"
pancat_byiso_bygenome = left_join(pancat_byiso_bygenome, number_genes_perstrain)
Joining, by = "strainID"
pancat_byiso_bygenome$ISME_source = factor(pancat_byiso_bygenome$ISME_source,
levels = c("warm host", "other host", "food", "marine", "terrestrial"))
pancat_byiso_bygenome$pan_cat = factor(pancat_byiso_bygenome$pan_cat,
levels = c("core", "soft core", "shell", "cloud"))
pancat_byiso_bygenome$ISME_source[pancat_byiso_bygenome$strainID == "P_sp_IAM12030_72-O-c"] <- "terrestrial"
pancat_byiso_bygenome_sp = spread(pancat_byiso_bygenome, key = "pan_cat", value = "frequency")
pancat_byiso_bygenome_sp$core_prop = pancat_byiso_bygenome_sp$core/pancat_byiso_bygenome_sp$num_genes
pancat_byiso_bygenome_sp$softcore_prop = pancat_byiso_bygenome_sp$`soft core`/pancat_byiso_bygenome_sp$num_genes
pancat_byiso_bygenome_sp$shell_prop = pancat_byiso_bygenome_sp$shell/pancat_byiso_bygenome_sp$num_genes
pancat_byiso_bygenome_sp$cloud_prop = pancat_byiso_bygenome_sp$cloud/pancat_byiso_bygenome_sp$num_genes
pancat_byiso_bygenome_ga = pancat_byiso_bygenome_sp %>%
gather(core_prop, softcore_prop, shell_prop, cloud_prop, key = "pan_cat", value = "proportion") %>%
select(-core, -`soft core`, -shell, -cloud)
pancat_byiso_bygenome_ga$pan_cat = str_remove(pancat_byiso_bygenome_ga$pan_cat, "_prop")
pancat_byiso_bygenome_ga$pan_cat = factor(pancat_byiso_bygenome_ga$pan_cat,
levels = c("core", "softcore", "shell", "cloud"))
(genenum_iso = ggplot(data = pancat_byiso_bygenome_ga, aes(x = ISME_source, y= num_genes, color = ISME_source)) +
geom_boxplot(alpha = 0) +
geom_point(position=position_dodge(width=0.75),aes(group=ISME_source)) +
scale_color_manual(values = ISMEcolors) +
#facet_grid(.~pan_cat, scales = "free") +
theme_classic())
(pan_str_iso = ggplot(data = pancat_byiso_bygenome_ga, aes(x = ISME_source, y= proportion, color = ISME_source)) +
geom_boxplot(alpha = 0) +
geom_point(position=position_dodge(width=0.75),aes(group=ISME_source)) +
scale_color_manual(values = ISMEcolors) +
facet_grid(.~pan_cat, scales = "free") +
theme_classic())
stats_core = pancat_byiso_bygenome_ga %>% filter(pan_cat == 'core')
kruskal.test(proportion ~ ISME_source, data = stats_core)
Kruskal-Wallis rank sum test
data: proportion by ISME_source
Kruskal-Wallis chi-squared = 18.442, df = 4, p-value = 0.001011
pairwise.wilcox.test(x = stats_core$proportion, g = stats_core$ISME_source, p.adjust.method = 'BH')
Pairwise comparisons using Wilcoxon rank sum test
data: stats_core$proportion and stats_core$ISME_source
warm host other host food marine
other host 0.013 - - -
food 0.013 0.918 - -
marine 0.738 0.022 0.023 -
terrestrial 0.918 0.013 0.014 0.745
P value adjustment method: BH
aggregate(stats_core$proportion, by = list(stats_core$ISME_source), FUN = median)
Group.1 x
1 warm host 0.4424822
2 other host 0.4151218
3 food 0.4139804
4 marine 0.4430722
5 terrestrial 0.4381671
stats_softcore = pancat_byiso_bygenome_ga %>% filter(pan_cat == 'softcore')
kruskal.test(proportion ~ ISME_source, data = stats_softcore)
Kruskal-Wallis rank sum test
data: proportion by ISME_source
Kruskal-Wallis chi-squared = 12.696, df = 4, p-value = 0.01286
pairwise.wilcox.test(x = stats_softcore$proportion, g = stats_softcore$ISME_source, p.adjust.method = 'BH')
Pairwise comparisons using Wilcoxon rank sum test
data: stats_softcore$proportion and stats_softcore$ISME_source
warm host other host food marine
other host 0.386 - - -
food 0.868 0.407 - -
marine 0.143 0.144 0.094 -
terrestrial 0.065 0.065 0.033 0.868
P value adjustment method: BH
aggregate(stats_softcore$proportion, by = list(stats_softcore$ISME_source), FUN = median)
Group.1 x
1 warm host 0.1645282
2 other host 0.1670198
3 food 0.1651576
4 marine 0.1757599
5 terrestrial 0.1745093
stats_shell = pancat_byiso_bygenome_ga %>% filter(pan_cat == 'shell')
kruskal.test(proportion ~ ISME_source, data = stats_shell)
Kruskal-Wallis rank sum test
data: proportion by ISME_source
Kruskal-Wallis chi-squared = 17.722, df = 4, p-value = 0.001398
pairwise.wilcox.test(x = stats_shell$proportion, g = stats_shell$ISME_source, p.adjust.method = 'BH')
cannot compute exact p-value with ties
Pairwise comparisons using Wilcoxon rank sum test
data: stats_shell$proportion and stats_shell$ISME_source
warm host other host food marine
other host 0.023 - - -
food 0.035 0.984 - -
marine 0.984 0.013 0.034 -
terrestrial 0.984 0.020 0.034 0.984
P value adjustment method: BH
aggregate(stats_shell$proportion, by = list(stats_shell$ISME_source), FUN = median)
Group.1 x
1 warm host 0.3662383
2 other host 0.4024303
3 food 0.3940601
4 marine 0.3621622
5 terrestrial 0.3669571
stats_cloud = pancat_byiso_bygenome_ga %>% filter(pan_cat == 'cloud')
kruskal.test(proportion ~ ISME_source, data = stats_cloud)
Kruskal-Wallis rank sum test
data: proportion by ISME_source
Kruskal-Wallis chi-squared = 7.691, df = 4, p-value = 0.1036
pairwise.wilcox.test(x = stats_cloud$proportion, g = stats_cloud$ISME_source, p.adjust.method = 'BH')
cannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with ties
Pairwise comparisons using Wilcoxon rank sum test
data: stats_cloud$proportion and stats_cloud$ISME_source
warm host other host food marine
other host 0.20 - - -
food 0.87 0.20 - -
marine 0.96 0.21 0.65 -
terrestrial 0.96 0.27 0.65 0.96
P value adjustment method: BH
aggregate(stats_cloud$proportion, by = list(stats_cloud$ISME_source), FUN = median)
Group.1 x
1 warm host 0.01546329
2 other host 0.01047651
3 food 0.02462527
4 marine 0.01698842
5 terrestrial 0.02118343
warm = psych_strain_collection %>% filter(ISME_source == "warm host") %>% select(strainID) %>% unlist() %>% unname()
other = psych_strain_collection %>% filter(ISME_source == "other host") %>% select(strainID) %>% unlist() %>% unname()
food = psych_strain_collection %>% filter(ISME_source == "food") %>% select(strainID) %>% unlist() %>% unname()
marine = psych_strain_collection %>% filter(ISME_source == "marine") %>% select(strainID) %>% unlist() %>% unname()
terrestrial = psych_strain_collection %>% filter(ISME_source == "terrestrial") %>% select(strainID) %>% unlist() %>% unname()
warmseqsum = rowSums(genes_mat_numbered[,colnames(genes_mat_numbered) %in% warm])
names(warmseqsum) = rownames(genes_mat)
warmisosum = rowSums(genematrix_binary[,colnames(genematrix_binary) %in% warm])
names(warmisosum) = rownames(genes_mat)
otherseqsum = rowSums(genes_mat_numbered[,colnames(genes_mat_numbered) %in% other])
names(otherseqsum) = rownames(genes_mat)
otherisosum = rowSums(genematrix_binary[,colnames(genematrix_binary) %in% other])
names(otherisosum) = rownames(genes_mat)
foodseqsum = rowSums(genes_mat_numbered[,colnames(genes_mat_numbered) %in% food])
names(otherseqsum) = rownames(genes_mat)
foodisosum = rowSums(genematrix_binary[,colnames(genematrix_binary) %in% food])
names(foodisosum) = rownames(genes_mat)
marineseqsum = rowSums(genes_mat_numbered[,colnames(genes_mat_numbered) %in% marine])
names(marineseqsum) = rownames(genes_mat)
marineisosum = rowSums(genematrix_binary[,colnames(genematrix_binary) %in% marine])
names(marineisosum) = rownames(genes_mat)
terrestrialseqsum = rowSums(genes_mat_numbered[,colnames(genes_mat_numbered) %in% terrestrial])
names(terrestrialseqsum) = rownames(genes_mat)
terrestrialisosum = rowSums(genematrix_binary[,colnames(genematrix_binary) %in% terrestrial])
names(terrestrialisosum) = rownames(genes_mat)
allseqsum = rowSums(genes_mat_numbered)
names(allseqsum) = rownames(genes_mat)
allisosum = rowSums(genematrix_binary)
names(allisosum) = rownames(genes_mat)
pan_str_iso = rbind(allseqsum, allisosum, warmseqsum, warmisosum, otherseqsum, otherisosum,
foodseqsum, foodisosum, marineseqsum, marineisosum, terrestrialseqsum, terrestrialisosum)
pan_str_iso = t(pan_str_iso)
pan_str_iso = as.data.frame(pan_str_iso)
pan_str_iso$geneCluster = rownames(pan_str_iso)
pan_str_iso$overall_pan_cat = NA
pan_str_iso$overall_pan_cat[pan_str_iso$allisosum >= 84] <- "core"
pan_str_iso$overall_pan_cat[pan_str_iso$allisosum <= 83 & pan_str_iso$allisosum >= 77] <- "soft core"
pan_str_iso$overall_pan_cat[pan_str_iso$allisosum <= 82 & pan_str_iso$allisosum >= 2 ] <- "shell"
pan_str_iso$overall_pan_cat[pan_str_iso$allisosum == 1 ] <- "cloud"
pan_str_iso$warm_pan_cat = NA
pan_str_iso$warm_pan_cat[pan_str_iso$warmisosum >= 13] <- "core"
pan_str_iso$warm_pan_cat[pan_str_iso$warmisosum <= 12 & pan_str_iso$warmisosum >= 2 ] <- "shell"
pan_str_iso$warm_pan_cat[pan_str_iso$warmisosum == 1 ] <- "cloud"
pan_str_iso$warm_pan_cat[pan_str_iso$warmisosum == 0 ] <- "absent"
pan_str_iso$other_pan_cat = NA
pan_str_iso$other_pan_cat[pan_str_iso$otherisosum >= 20] <- "core"
pan_str_iso$other_pan_cat[pan_str_iso$otherisosum <= 19 & pan_str_iso$otherisosum >= 2 ] <- "shell"
pan_str_iso$other_pan_cat[pan_str_iso$otherisosum == 1 ] <- "cloud"
pan_str_iso$other_pan_cat[pan_str_iso$otherisosum == 0 ] <- "absent"
pan_str_iso$food_pan_cat = NA
pan_str_iso$food_pan_cat[pan_str_iso$foodisosum == 11] <- "core"
pan_str_iso$food_pan_cat[pan_str_iso$foodisosum <= 10 & pan_str_iso$foodisosum >= 2 ] <- "shell"
pan_str_iso$food_pan_cat[pan_str_iso$foodisosum == 1 ] <- "cloud"
pan_str_iso$food_pan_cat[pan_str_iso$foodisosum == 0 ] <- "absent"
pan_str_iso$marine_pan_cat = NA
pan_str_iso$marine_pan_cat[pan_str_iso$marineisosum >= 19] <- "core"
pan_str_iso$marine_pan_cat[pan_str_iso$marineisosum <= 18 & pan_str_iso$marineisosum >= 2 ] <- "shell"
pan_str_iso$marine_pan_cat[pan_str_iso$marineisosum == 1 ] <- "cloud"
pan_str_iso$marine_pan_cat[pan_str_iso$marineisosum == 0 ] <- "absent"
pan_str_iso$terrestrial_pan_cat = NA
pan_str_iso$terrestrial_pan_cat[pan_str_iso$terrestrialisosum >= 15] <- "core"
pan_str_iso$terrestrial_pan_cat[pan_str_iso$terrestrialisosum <= 14 & pan_str_iso$terrestrialisosum >= 2 ] <- "shell"
pan_str_iso$terrestrial_pan_cat[pan_str_iso$terrestrialisosum == 1 ] <- "cloud"
pan_str_iso$terrestrial_pan_cat[pan_str_iso$terrestrialisosum == 0 ] <- "absent"
unique_warm = pan_str_iso %>% filter(warm_pan_cat != "absent" & other_pan_cat == "absent" & food_pan_cat == "absent" & marine_pan_cat == "absent" & terrestrial_pan_cat == "absent") %>% select(geneCluster) %>% unlist() %>% unname
unique_warm_annot = multiCOG_genes_corrected %>% filter(geneCluster %in% unique_warm)
unique_warm_annot$isolation = "warm"
unique_other = pan_str_iso %>% filter(warm_pan_cat == "absent" & other_pan_cat != "absent" & food_pan_cat == "absent" & marine_pan_cat == "absent" & terrestrial_pan_cat == "absent") %>% select(geneCluster) %>% unlist() %>% unname
unique_other_annot = multiCOG_genes_corrected %>% filter(geneCluster %in% unique_other)
unique_other_annot$isolation = "other"
unique_food = pan_str_iso %>% filter(warm_pan_cat == "absent" & other_pan_cat == "absent" & food_pan_cat != "absent" & marine_pan_cat == "absent" & terrestrial_pan_cat == "absent") %>% select(geneCluster) %>% unlist() %>% unname
unique_food_annot = multiCOG_genes_corrected %>% filter(geneCluster %in% unique_food)
unique_food_annot$isolation = "food"
unique_marine = pan_str_iso %>% filter(warm_pan_cat == "absent" & other_pan_cat == "absent" & food_pan_cat == "absent" & marine_pan_cat != "absent" & terrestrial_pan_cat == "absent") %>% select(geneCluster) %>% unlist() %>% unname
unique_marine_annot = multiCOG_genes_corrected %>% filter(geneCluster %in% unique_marine)
unique_marine_annot$isolation = "marine"
unique_terrestrial = pan_str_iso %>% filter(warm_pan_cat == "absent" & other_pan_cat == "absent" & food_pan_cat == "absent" & marine_pan_cat == "absent" & terrestrial_pan_cat != "absent") %>% select(geneCluster) %>% unlist() %>% unname
unique_terrestrial_annot = multiCOG_genes_corrected %>% filter(geneCluster %in% unique_terrestrial)
unique_terrestrial_annot$isolation = "terrestrial"
unique_genes_per_isolation = rbind(unique_warm_annot, unique_other_annot, unique_food_annot, unique_marine_annot, unique_terrestrial_annot)
unique_genes_per_isolation$isolation = factor(unique_genes_per_isolation$isolation, levels = c("warm", "other", "food", "marine", "terrestrial"))
unique_genes_per_isolation$COG_category = factor(unique_genes_per_isolation$COG_category, levels = c("D", "M", "N", "O", "T", "U", "V", "W", "Z",
"A", "B","J", "K", "L",
"C", "E", "F", "G", "H", "I", "P", "Q",
"S", "X"))
cols = c("#72300C", "#93074D", "#D60D69", "#ED5FA6", "#CE1B13", "#A50505",
"#E5530A", "#E2770C", "#F9A138", "#F7F072", "#F7D239", "#CCF43B",
"#5BB73D", "#057F28", "#05541A", "#20998B", "#60B1F4", "#1458F2",
"#5141AF", "#1F048E", "#310872", "#520775", "#9F9FA0", "#39383A")
(unique_gene_content = ggplot(data = unique_genes_per_isolation, aes(x = isolation, fill = COG_category)) +
geom_bar(stat = 'count') +
scale_fill_manual(values = cols) +
theme_classic())
unique_corr = table(unique_genes_per_isolation$isolation, unique_genes_per_isolation$COG_category)
chisq_uni_iso = chisq.test(unique_corr)
Chi-squared approximation may be incorrect
corrplot(chisq_uni_iso$residuals, is.corr = F)
#add unique_gene_content w corrplot in illustrator
unique(psych_strain_collection$ISME_source)
[1] "marine" "food" "terrestrial" "warm host" "other host"
isolationsource = psych_strain_collection[,c(3,27)]
isolationsource$dummy = 1
isolationsource = isolationsource %>%
spread(key = ISME_source, value = dummy)
isolationsource[is.na(isolationsource)] <- 0
rownames(isolationsource) = isolationsource$strainID
WH <- as.vector(unlist(isolationsource$`warm host`))
names(WH) = rownames(isolationsource)
all(names(WH) %in% rownames(genematrix_treewas))
[1] TRUE
all(rownames(genematrix_treewas) %in% names(WH))
[1] TRUE
WH_treewas <- treeWAS(snps = genematrix_treewas, phen = WH, tree = P.tree.noout, seed = 1)
[1] "treeWAS snps sim done @ 2021-03-24 10:11:57"
[1] "Reconstructions completed @ 2021-03-24 10:12:10"
[1] "Started running terminal test @ 2021-03-24 10:12:10"
[1] "Real data scores completed for terminal test @ 2021-03-24 10:12:10"
[1] "Simulated data scores completed for terminal test @ 2021-03-24 10:12:10"
[1] "Started running simultaneous test @ 2021-03-24 10:12:10"
[1] "Real data scores completed for simultaneous test @ 2021-03-24 10:12:10"
[1] "Simulated data scores completed for simultaneous test @ 2021-03-24 10:12:11"
[1] "Started running subsequent test @ 2021-03-24 10:12:11"
[1] "Real data scores completed for subsequent test @ 2021-03-24 10:12:11"
[1] "Simulated data scores completed for subsequent test @ 2021-03-24 10:12:11"
[1] "Finished running terminal test @ 2021-03-24 10:12:12"
[1] "Finished running simultaneous test @ 2021-03-24 10:12:12"
[1] "Finished running subsequent test @ 2021-03-24 10:12:12"
[1] "ID of significant loci completed @ 2021-03-24 10:12:12"
print(WH_treewas)
####################
## treeWAS output ##
####################
####################
## All findings: ##
####################
Number of significant loci: [1] 0
########################
## Findings by test: ##
########################
####################
## terminal test ##
####################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
0.8588235
########################
## simultaneous test ##
########################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
4.5
######################
## subsequent test ##
######################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
159.3333
OH <- as.vector(unlist(isolationsource$`other host`))
names(OH) = rownames(isolationsource)
all(names(OH) %in% rownames(genematrix_treewas))
[1] TRUE
all(rownames(genematrix_treewas) %in% names(OH))
[1] TRUE
OH_treewas <- treeWAS(snps = genematrix_treewas, phen = OH, tree = P.tree.noout, seed = 1)
[1] "treeWAS snps sim done @ 2021-03-24 10:12:20"
[1] "Reconstructions completed @ 2021-03-24 10:12:31"
[1] "Started running terminal test @ 2021-03-24 10:12:31"
[1] "Real data scores completed for terminal test @ 2021-03-24 10:12:31"
[1] "Simulated data scores completed for terminal test @ 2021-03-24 10:12:32"
[1] "Started running simultaneous test @ 2021-03-24 10:12:32"
[1] "Real data scores completed for simultaneous test @ 2021-03-24 10:12:32"
[1] "Simulated data scores completed for simultaneous test @ 2021-03-24 10:12:32"
[1] "Started running subsequent test @ 2021-03-24 10:12:32"
[1] "Real data scores completed for subsequent test @ 2021-03-24 10:12:33"
[1] "Simulated data scores completed for subsequent test @ 2021-03-24 10:12:33"
[1] "Finished running terminal test @ 2021-03-24 10:12:33"
[1] "Finished running simultaneous test @ 2021-03-24 10:12:33"
[1] "Finished running subsequent test @ 2021-03-24 10:12:34"
[1] "ID of significant loci completed @ 2021-03-24 10:12:34"
print(OH_treewas)
####################
## treeWAS output ##
####################
####################
## All findings: ##
####################
Number of significant loci: [1] 5
Significant loci:
[1] "GC00002129_r1_3" "GC00002231_r1_1" "GC00002376" "GC00002380_r1_2"
[5] "GC00002622"
########################
## Findings by test: ##
########################
####################
## terminal test ##
####################
Number of significant loci:
[1] 1
Significance threshold:
99.99997%
0.6235294
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00002622 6994 0 0.6705882 11 60 2 12
########################
## simultaneous test ##
########################
Number of significant loci:
[1] 4
Significance threshold:
99.99997%
4
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00002129_r1_3 5957 0 6 9 56 6 14
GC00002231_r1_1 6198 0 5 11 53 9 12
GC00002376 6521 0 5 6 52 10 17
GC00002380_r1_2 6531 0 6 10 56 6 13
######################
## subsequent test ##
######################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
138.2667
Fo <- as.vector(unlist(isolationsource$food))
names(Fo) = rownames(isolationsource)
all(names(Fo) %in% rownames(genematrix_treewas))
[1] TRUE
all(rownames(genematrix_treewas) %in% names(Fo))
[1] TRUE
Fo_treewas <- treeWAS(snps = genematrix_treewas, phen = Fo, tree = P.tree.noout, seed = 1)
[1] "treeWAS snps sim done @ 2021-03-24 10:12:41"
[1] "Reconstructions completed @ 2021-03-24 10:12:53"
[1] "Started running terminal test @ 2021-03-24 10:12:53"
[1] "Real data scores completed for terminal test @ 2021-03-24 10:12:53"
[1] "Simulated data scores completed for terminal test @ 2021-03-24 10:12:54"
[1] "Started running simultaneous test @ 2021-03-24 10:12:54"
[1] "Real data scores completed for simultaneous test @ 2021-03-24 10:12:54"
[1] "Simulated data scores completed for simultaneous test @ 2021-03-24 10:12:54"
[1] "Started running subsequent test @ 2021-03-24 10:12:54"
[1] "Real data scores completed for subsequent test @ 2021-03-24 10:12:54"
[1] "Simulated data scores completed for subsequent test @ 2021-03-24 10:12:54"
[1] "Finished running terminal test @ 2021-03-24 10:12:55"
[1] "Finished running simultaneous test @ 2021-03-24 10:12:55"
[1] "Finished running subsequent test @ 2021-03-24 10:12:55"
[1] "ID of significant loci completed @ 2021-03-24 10:12:55"
print(Fo_treewas)
####################
## treeWAS output ##
####################
####################
## All findings: ##
####################
Number of significant loci: [1] 5
Significant loci:
[1] "GC00000256_r1_r1_1_p1" "GC00002564" "GC00002565"
[4] "GC00002566" "GC00002666"
########################
## Findings by test: ##
########################
####################
## terminal test ##
####################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
0.8580392
########################
## simultaneous test ##
########################
Number of significant loci:
[1] 5
Significance threshold:
99.99997%
4.966667
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00000256_r1_r1_1_p1 1711 0 5 7 53 21 4
GC00002564 6889 0 5 5 67 7 6
GC00002565 6890 0 5 5 67 7 6
GC00002566 6891 0 5 5 67 7 6
GC00002666 7085 0 5 5 68 6 6
######################
## subsequent test ##
######################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
159.9889
Ma <- as.vector(unlist(isolationsource$marine))
names(Ma) = rownames(isolationsource)
all(names(Ma) %in% rownames(genematrix_treewas))
[1] TRUE
all(rownames(genematrix_treewas) %in% names(Ma))
[1] TRUE
Ma_treewas <- treeWAS(snps = genematrix_treewas, phen = Ma, tree = P.tree.noout, seed = 1)
[1] "treeWAS snps sim done @ 2021-03-24 10:13:02"
[1] "Reconstructions completed @ 2021-03-24 10:13:14"
[1] "Started running terminal test @ 2021-03-24 10:13:14"
[1] "Real data scores completed for terminal test @ 2021-03-24 10:13:14"
[1] "Simulated data scores completed for terminal test @ 2021-03-24 10:13:15"
[1] "Started running simultaneous test @ 2021-03-24 10:13:15"
[1] "Real data scores completed for simultaneous test @ 2021-03-24 10:13:15"
[1] "Simulated data scores completed for simultaneous test @ 2021-03-24 10:13:15"
[1] "Started running subsequent test @ 2021-03-24 10:13:15"
[1] "Real data scores completed for subsequent test @ 2021-03-24 10:13:15"
[1] "Simulated data scores completed for subsequent test @ 2021-03-24 10:13:15"
[1] "Finished running terminal test @ 2021-03-24 10:13:16"
[1] "Finished running simultaneous test @ 2021-03-24 10:13:16"
[1] "Finished running subsequent test @ 2021-03-24 10:13:16"
[1] "ID of significant loci completed @ 2021-03-24 10:13:16"
print(Ma_treewas)
####################
## treeWAS output ##
####################
####################
## All findings: ##
####################
Number of significant loci: [1] 4
Significant loci:
[1] "GC00001426_2" "GC00002047" "GC00002181_2" "GC00003144"
########################
## Findings by test: ##
########################
####################
## terminal test ##
####################
Number of significant loci:
[1] 2
Significance threshold:
99.99997%
0.6227451
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00001426_2 4122 0 -0.6235294 15 1 63 6
GC00003144 7862 0 0.6235294 6 63 1 15
########################
## simultaneous test ##
########################
Number of significant loci:
[1] 2
Significance threshold:
99.99997%
5
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00002047 5782 0 11 13 49 15 8
GC00002181_2 6087 0 8 11 52 12 10
######################
## subsequent test ##
######################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
143.6667
Te <- as.vector(unlist(isolationsource$terrestrial))
names(Te) = rownames(isolationsource)
all(names(Te) %in% rownames(genematrix_treewas))
[1] TRUE
all(rownames(genematrix_treewas) %in% names(Te))
[1] TRUE
Te_treewas <- treeWAS(snps = genematrix_treewas, phen = Te, tree = P.tree.noout, seed = 1)
[1] "treeWAS snps sim done @ 2021-03-24 10:13:23"
[1] "Reconstructions completed @ 2021-03-24 10:13:36"
[1] "Started running terminal test @ 2021-03-24 10:13:36"
[1] "Real data scores completed for terminal test @ 2021-03-24 10:13:36"
[1] "Simulated data scores completed for terminal test @ 2021-03-24 10:13:37"
[1] "Started running simultaneous test @ 2021-03-24 10:13:37"
[1] "Real data scores completed for simultaneous test @ 2021-03-24 10:13:37"
[1] "Simulated data scores completed for simultaneous test @ 2021-03-24 10:13:37"
[1] "Started running subsequent test @ 2021-03-24 10:13:37"
[1] "Real data scores completed for subsequent test @ 2021-03-24 10:13:38"
[1] "Simulated data scores completed for subsequent test @ 2021-03-24 10:13:38"
[1] "Finished running terminal test @ 2021-03-24 10:13:38"
[1] "Finished running simultaneous test @ 2021-03-24 10:13:38"
[1] "Finished running subsequent test @ 2021-03-24 10:13:38"
[1] "ID of significant loci completed @ 2021-03-24 10:13:38"
print(Te_treewas)
####################
## treeWAS output ##
####################
####################
## All findings: ##
####################
Number of significant loci: [1] 1
Significant loci:
[1] "GC00000111_6"
########################
## Findings by test: ##
########################
####################
## terminal test ##
####################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
0.7647059
########################
## simultaneous test ##
########################
Number of significant loci:
[1] 1
Significance threshold:
99.99997%
4.966667
Significant loci:
SNP.locus p.value score G1P1 G0P0 G1P0 G0P1
GC00000111_6 1141 0 -5 6 15 54 10
######################
## subsequent test ##
######################
Number of significant loci:
[1] 0
Significance threshold:
99.99997%
153.6667
OH_treewas_gc = OH_treewas$treeWAS.combined$treeWAS.combined %>% unlist()
OH_treewas_annotated = psych_eggnog_annotations %>% filter(geneCluster %in% OH_treewas_gc)
Fo_treewas_gc = Fo_treewas$treeWAS.combined$treeWAS.combined %>% unlist()
Fo_treewas_annotated = psych_eggnog_annotations %>% filter(geneCluster %in% Fo_treewas_gc)
Ma_treewas_gc = Ma_treewas$treeWAS.combined$treeWAS.combined %>% unlist()
Ma_treewas_annotated = psych_eggnog_annotations %>% filter(geneCluster %in% Ma_treewas_gc)
Te_treewas_gc = Te_treewas$treeWAS.combined$treeWAS.combined %>% unlist()
Te_treewas_annotated = psych_eggnog_annotations %>% filter(geneCluster %in% Te_treewas_gc)
#two state ancestral state reconstruction
ordered_df = strain_collection[match(M.tree$tip.label, strain_collection$strainID),]
rownames(ordered_df) = ordered_df$strainID
x1<-setNames(ordered_df[,21],rownames(ordered_df))
x1col = c("deepskyblue3", "orange")
fitER1 = ace(x1, M.tree, model = "ER", type = "discrete", marginal = TRUE)
fitSYM1 = ace(x1, M.tree, model = "SYM", type = "discrete", marginal = TRUE)
fitARD1 = ace(x1, M.tree, model = "ARD", type = "discrete", marginal = TRUE)
#ARD model maximizes likelihood, but does it do so significantly?
1 - pchisq(2*abs(fitARD1$loglik - fitER1$loglik), 1) #p = 0.15
[1] 0.1544003
1 - pchisq(2*abs(fitARD1$loglik - fitSYM1$loglik), 1) #p = 0.15
[1] 0.1544003
#ok, so is there a difference between ER and SYM?
1 - pchisq(2*abs(fitER1$loglik - fitSYM1$loglik), 1)
[1] 1
#no
#so I will go with the simplest ER model
plotTree(M.tree,fsize=0.8,ftype="i")
nodelabels(node=1:M.tree$Nnode+Ntip(M.tree),pie=fitER1$lik.anc,piecol = x1col,cex=0.5)
#tiplabels(pie=to.matrix(x1[M.tree$tip.label],levels(x1)),piecol = x1col,cex=0.3)
#add.simmap.legend(leg=sort(unique(x1)),colors=x1col,x=0.9*par()$usr[1],
# y=-max(nodeHeights(M.tree)),fsize=0.8)
#plotTree(M.tree,fsize=0.8,ftype="i")
tiplabels(pie=to.matrix(x1,sort(unique(x1))),piecol=x1col,cex=0.3)
add.simmap.legend(colors=x1col,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(M.tree)),fsize=0.8)
fitER1
Ancestral Character Estimation
Call: ace(x = x1, phy = M.tree, type = "discrete", model = "ER", marginal = TRUE)
Log-likelihood: -39.5478
Rate index matrix:
c w
c . 1
w 1 .
Parameter estimates:
rate index estimate std-err
1 1.4862 0.4245
Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
c w
0.1730664 0.8269336
psychroot = drop.tip(M.tree, c("M_boevrei","M_atlantae","M_osloensis","A_puyangensis",
"M_cuniculi","M_porci","M_pluranimalium","M_canis","M_catarrhalis","M_caviae",
"M_ovis","M_bovoculi","M_oblonga","M_equi","M_bovis","M_caprae","M_lacunata","M_nonliquefaciens"))
x1psychroot<-x1[names(x1) %in% psychroot$tip.label]
fitER_psychroot = ace(x1psychroot, psychroot, model = "ER", type = "discrete", marginal = TRUE)
fitER_psychroot
Ancestral Character Estimation
Call: ace(x = x1psychroot, phy = psychroot, type = "discrete", model = "ER",
marginal = TRUE)
Log-likelihood: -32.31518
Rate index matrix:
c w
c . 1
w 1 .
Parameter estimates:
rate index estimate std-err
1 2.7609 0.7641
Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
c w
0.4464888 0.5535112
plotTree(psychroot,fsize=0.8,ftype="i")
nodelabels(node=1:psychroot$Nnode+Ntip(psychroot),pie=fitER_psychroot$lik.anc,piecol = x1col,cex=0.5)
#tiplabels(pie=to.matrix(x1psychroot[psychroot$tip.label],levels(x1psychroot)),piecol = x1col,cex=0.3)
#add.simmap.legend(leg=sort(unique(x1psychroot)),colors=x1col,x=0.9*par()$usr[1],
# y=-max(nodeHeights(psychroot)),fsize=0.8)
#plotTree(M.tree,fsize=0.8,ftype="i")
tiplabels(pie=to.matrix(x1psychroot,sort(unique(x1psychroot))),piecol=x1col,cex=0.3)
add.simmap.legend(colors=x1col,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(psychroot)),fsize=0.8)
#now by more specific isolation sources
ordered_df$ISME_source = factor(ordered_df$ISME_source,
levels = c("warm host", "other host", "food", "marine", "terrestrial"))
x2<-setNames(ordered_df[,19],rownames(ordered_df))
x2col = c("#F9D503", "#5DC863", "#21908C", "#3B528B", "#440154")
fitER2 = ace(x2, M.tree, model = "ER", type = "discrete", marginal = TRUE)
fitSYM2 = ace(x2, M.tree, model = "SYM", type = "discrete", marginal = TRUE)
NaNs producedNA/Inf replaced by maximum positive valueNaNs produced
fitARD2 = ace(x2, M.tree, model = "ARD", type = "discrete", marginal = TRUE)
imaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionimaginary parts discarded in coercionNaNs producedNA/Inf replaced by maximum positive valueNaNs produced
#ARD model maximizes likelihood, but does it do so significantly?
1 - pchisq(2*abs(fitARD2$loglik - fitER2$loglik), 1) #p = 1.110223e-15
[1] 1.110223e-15
1 - pchisq(2*abs(fitARD2$loglik - fitSYM2$loglik), 1) #p = 4.260376e-05
[1] 4.260376e-05
#both significant, so
#I will go with the ARD model
1 - pchisq(2*abs(fitSYM2$loglik - fitER2$loglik), 1)
[1] 5.422884e-12
plotTree(M.tree,fsize=0.8,ftype="i")
tiplabels(pie=to.matrix(x2,sort(unique(x2))),piecol=x2col,cex=0.3)
add.simmap.legend(colors=x2col,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(M.tree)),fsize=0.8)
#plotTree(M.tree,fsize=0.8,ftype="i")
nodelabels(node=1:M.tree$Nnode+Ntip(M.tree),pie=fitER2$lik.anc,piecol = x2col,cex=0.5)
#tiplabels(pie=to.matrix(x2[M.tree$tip.label],levels(x2)),piecol = x2col,cex=0.3)
#add.simmap.legend(leg=sort(unique(x2)),colors=x2col,x=0.9*par()$usr[1],
# y=-max(nodeHeights(M.tree)),fsize=0.8)
x2psychroot<-x2[names(x2) %in% psychroot$tip.label]
fitER2_psychroot = ace(x2psychroot, psychroot, model = "ER", type = "discrete", marginal = TRUE)
fitER2_psychroot
Ancestral Character Estimation
Call: ace(x = x2psychroot, phy = psychroot, type = "discrete", model = "ER",
marginal = TRUE)
Log-likelihood: -126.5058
Rate index matrix:
warm host other host food marine terrestrial
warm host . 1 1 1 1
other host 1 . 1 1 1
food 1 1 . 1 1
marine 1 1 1 . 1
terrestrial 1 1 1 1 .
Parameter estimates:
rate index estimate std-err
1 14.7976 3.0498
Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
warm host other host food marine terrestrial
0.2 0.2 0.2 0.2 0.2
plotTree(psychroot,fsize=0.8,ftype="i")
nodelabels(node=1:psychroot$Nnode+Ntip(psychroot),pie=fitER2_psychroot$lik.anc,piecol = x2col,cex=0.5)
#tiplabels(pie=to.matrix(x2psychroot[psychroot$tip.label],levels(x1psychroot)),piecol = x2col,cex=0.3)
#add.simmap.legend(leg=sort(unique(x2psychroot)),colors=x2col,x=0.9*par()$usr[1],
# y=-max(nodeHeights(psychroot)),fsize=0.8)
#plotTree(M.tree,fsize=0.8,ftype="i")
tiplabels(pie=to.matrix(x2psychroot,sort(unique(x2psychroot))),piecol=x2col,cex=0.3)
add.simmap.legend(colors=x2col,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(psychroot)),fsize=0.8)
psych_genecluster_summ = as.data.frame(table(psych_PCAT$geneCluster))
colnames(psych_genecluster_summ) = c("geneCluster", "frequency")
highpcatfreq = as.data.frame(table(psych_highPCAT$geneCluster))
colnames(highpcatfreq) = c("geneCluster", "freq_highPCAT")
psych_genecluster_summ = left_join(psych_genecluster_summ, highpcatfreq)
Joining, by = "geneCluster"
psych_genecluster_summ$freq_highPCAT[is.na(psych_genecluster_summ$freq_highPCAT) == TRUE] <- 0
psych_genecluster_summ$perc_highPCAT = psych_genecluster_summ$freq_highPCAT/psych_genecluster_summ$frequency
psych_genecluster_summ_nosingletons = psych_genecluster_summ %>% filter(frequency != 1)
(perHPCAT = ggplot(data = psych_genecluster_summ, aes(x = perc_highPCAT)) + geom_histogram() + theme_classic() + xlab("Percentage of CDS in geneCluster that are classified as 'highly cold adaptive'"))
(perHPCAT_noS = ggplot(data = psych_genecluster_summ_nosingletons, aes(x = perc_highPCAT)) + geom_histogram() + theme_classic() + xlab("Percentage of CDS in geneCluster that are classified as 'highly cold adaptive', singletons removed"))
ggarrange(perHPCAT, perHPCAT_noS, align = 'h', labels = c("A", "B"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
psych_PCAT_strainID_summ = as.data.frame(table(psych_PCAT$strainID))
colnames(psych_PCAT_strainID_summ) = c("strainID", "frequency")
highpcat_bystrain = as.data.frame(table(psych_highPCAT$strainID))
colnames(highpcat_bystrain) = c("strainID", "freq_highPCAT")
psych_PCAT_strainID_summ = left_join(psych_PCAT_strainID_summ, highpcat_bystrain)
Joining, by = "strainID"
psych_PCAT_strainID_summ$perc_highPCAT = psych_PCAT_strainID_summ$freq_highPCAT/psych_PCAT_strainID_summ$frequency
perc_highPCAT = psych_PCAT_strainID_summ$perc_highPCAT
names(perc_highPCAT) = psych_PCAT_strainID_summ$strainID
setdiff(names(perc_highPCAT), P.tree.noout$tip.label)
[1] "P_sp_IAM12030_72-O-c"
phylo_PCAT = phylosig(tree = P.tree.noout, x = perc_highPCAT, test = TRUE)
[1] "some species in x are missing from tree, dropping missing taxa from x"
[1] "some species in tree are missing from x , dropping missing taxa from the tree"
phylo_PCAT
Phylogenetic signal K : 0.456351
P-value (based on 1000 randomizations) : 0.001
(HPCAT_bystrain = ggplot(data = psych_PCAT_strainID_summ, aes(x = perc_highPCAT)) + geom_histogram() + theme_classic() + xlab("Percentage of highly cold adaptive proteins by strain"))
psych_genome_summaries = left_join(psych_genome_summaries, psych_strain_collection[,c(3,5,21,27)])
Joining, by = "strain_plus_code"
psych_PCAT_strainID_summ$strainID = as.character(psych_PCAT_strainID_summ$strainID)
psych_PCAT_strainID_summ$strainID[psych_PCAT_strainID_summ$strainID == "P_sp_IAM12030_72-O-c"] <- "P_sp_IAM12030-72-O-c"
psych_PCAT_strainID_summ = left_join(psych_PCAT_strainID_summ, psych_genome_summaries[,c(19,5,20,21)])
Joining, by = "strainID"
psych_PCAT_strainID_summ$temprange = "mesophile"
psych_PCAT_strainID_summ$temprange[psych_PCAT_strainID_summ$strainID == "P_frigidicola_ACAM304" |
psych_PCAT_strainID_summ$strainID == "P_frigidicola_ACAM309" |
psych_PCAT_strainID_summ$strainID == "P_sp_72-O-c" |
psych_PCAT_strainID_summ$strainID == "P_sp_IAM12030-72-O-c" |
psych_PCAT_strainID_summ$strainID == "P_urativorans_ACAM311"] <- "psychrophile"
wilcox.test(perc_highPCAT ~ temprange, data = psych_PCAT_strainID_summ)
Wilcoxon rank sum test with continuity correction
data: perc_highPCAT by temprange
W = 256, p-value = 0.2999
alternative hypothesis: true location shift is not equal to 0
kruskal.test(frequency ~ ISME_source, data = psych_PCAT_strainID_summ)
Kruskal-Wallis rank sum test
data: frequency by ISME_source
Kruskal-Wallis chi-squared = 16.388, df = 4, p-value = 0.00254
kruskal.test(Genome.size..bp. ~ ISME_source, data = psych_PCAT_strainID_summ)
Kruskal-Wallis rank sum test
data: Genome.size..bp. by ISME_source
Kruskal-Wallis chi-squared = 14.093, df = 4, p-value = 0.007005
(genes_by_isosource = ggplot(data = psych_PCAT_strainID_summ, aes(x = ISME_source, y = frequency)) +
geom_boxplot(alpha = 0, aes(color = ISME_source)) +
geom_point(aes(color = ISME_source)) +
scale_color_manual(values = ISMEcolors) +
geom_signif(comparisons = list(c("warm host", "other host"), c("warm host", "food"), c("other host", "marine"), c("other host", "terrestrial"), c("food", "marine"), c("food", "terrestrial")), map_signif_level = T) +
theme_classic() +
ylab("# of genes per genome"))
#all pairwise comparisons:
#comparisons = split(t(combn(levels(psych_PCAT_strainID_summ$ISME_source), 2)), seq(nrow(t(combn(levels(psych_PCAT_strainID_summ$ISME_source), 2)))))
pairwise.wilcox.test(x = psych_PCAT_strainID_summ$Genome.size..bp., g = psych_PCAT_strainID_summ$ISME_source, p.adjust.method = 'BH')
Pairwise comparisons using Wilcoxon rank sum test
data: psych_PCAT_strainID_summ$Genome.size..bp. and psych_PCAT_strainID_summ$ISME_source
food marine other host terrestrial
marine 0.093 - - -
other host 0.951 0.046 - -
terrestrial 0.031 0.951 0.031 -
warm host 0.031 0.951 0.031 0.951
P value adjustment method: BH
pairwise.wilcox.test(x = psych_PCAT_strainID_summ$frequency, g = psych_PCAT_strainID_summ$ISME_source, p.adjust.method = 'BH')
cannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with ties
Pairwise comparisons using Wilcoxon rank sum test
data: psych_PCAT_strainID_summ$frequency and psych_PCAT_strainID_summ$ISME_source
food marine other host terrestrial
marine 0.046 - - -
other host 1.000 0.032 - -
terrestrial 0.034 0.913 0.032 -
warm host 0.021 0.655 0.021 0.913
P value adjustment method: BH
(HCA_by_temprange = ggplot(data = psych_PCAT_strainID_summ, aes(x = temprange, y = perc_highPCAT)) +
geom_boxplot(alpha = 0) +
theme_classic() +
ylab("% of genes that are highly cold adapted"))
psych_PCAT_strainID_summ$ISME_source = factor(psych_PCAT_strainID_summ$ISME_source, levels = c("warm host", "other host", "food", "marine", "terrestrial"))
(HCA_by_isosource = ggplot(data = psych_PCAT_strainID_summ, aes(x = ISME_source, y = perc_highPCAT)) +
geom_boxplot(alpha = 0, aes(color = ISME_source)) +
geom_point(aes(color = ISME_source)) +
scale_color_manual(values = ISMEcolors) +
theme_classic() +
ylab("% of genes that are highly cold adapted"))
kruskal.test(perc_highPCAT ~ ISME_source, data = psych_PCAT_strainID_summ)
Kruskal-Wallis rank sum test
data: perc_highPCAT by ISME_source
Kruskal-Wallis chi-squared = 8.5337, df = 4, p-value = 0.07387
ggarrange(HCA_by_temprange, HCA_by_isosource, labels = c("A", "B"), widths = c(1, 3))
highPCATlocus = psych_highPCAT %>% select(locus_tag) %>% unlist() %>% unname()
psych_PCAT$PCAT = NA
psych_PCAT$PCAT[psych_PCAT$locus_tag %in% highPCATlocus] <- "high"
psych_PCAT$PCAT[is.na(psych_PCAT$PCAT) == TRUE] <- "not"
psych_PCAT$PCAT = factor(psych_PCAT$PCAT, levels = c("not", "high"))
psych_PCAT_annot = left_join(psych_PCAT, multiCOG_genes_corrected)
Joining, by = "geneCluster"
(function_PCAT = ggplot(data = psych_PCAT_annot,
aes(x = PCAT, fill = COG_category)) +
geom_bar(stat = 'count', position = 'fill') +
scale_fill_manual(values = COGcols) +
theme_classic())
COG_PCAT_chi = chisq.test(t(table(psych_PCAT_annot$PCAT, psych_PCAT_annot$COG_category)))
COG_PCAT_chi
Pearson's Chi-squared test
data: t(table(psych_PCAT_annot$PCAT, psych_PCAT_annot$COG_category))
X-squared = 3750.4, df = 23, p-value < 2.2e-16
COG_PCAT_cor = corrplot(COG_PCAT_chi$residuals, is.cor = FALSE)
#arrange function_PCAT w corrplot in illustrator